0.1 Dimensionality Reduction

0.1.1 PCA

PCA: an orthogonal linear transformation which projects the genotype data into the new reduced dimensional space such that the greater variance comes in an order

0.1.2 UMAP

UMAP: a novel non-linear dimensionality reduction technique based on Riemannian geometry and algebraic topology to model and preserve the high-dimensional topology of data points in the lowdimensional space

PCA–UMAP: an application of UMAP for principal components of genotype data to be computationally more advantageous and statistically less noisy

UMAP has several hyperparameters that can have a signficant impact on the resulting embedding.

  • n_neighbors: The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100 (default = 15).
    • Using few neighbours to build a manifold emphasizes closer relatedness, with related samples clustered togeather.
  • min_dist: The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the “spread“ value. (default = 0.1)
    • spread can therefore be used to control the inter-cluster distances to some extent, where as min_dist controls the size of the clusters.
  • n_components: The dimension of the space to embed into (defaults = 2)
  • metric: The metric to use to compute distances in high dimensional space (default = euclidean)
  • n pca: For genetic ancestry the number of PCs used in the data input. Adding more components results in progressively finer clusters until approximately 20 populations appear using 15 components; adding further components converges to results similar to using the entire genotype data

0.1.3 References

0.1.3.1 Papers

  • Diaz-Papkovich, A., et al. (2019). UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts. PLOS Genetics 15(11), e1008432
    • data = 1kg, UKBB, HRS; metric = Eucledian, PCs = 10; n_neigbours = 15, min_dist = 0.5
    • Parameter Tuning
      • PCs: 2 - 15
      • min_dist: 0.001 - 0.5
      • n_neigbours: 5 - 50
  • Karczewski, K., et al. (2020). The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581(7809), 434-443
    • data = gnomAD; PCs = 7; min_dist = 0.5; n_neigbours = 30
  • Sakaue, S., et al. (2020). Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nature Communications 11(1), 1569
    • data = Japanse Biobank; PCs = 50; min_dist = 0.1; n_neigbours = 15;
  • Yamamoto, K.,et al. (2020). Genetic and phenotypic landscape of the mitochondrial genome in the Japanese population. Communications Biology 3(1), 104
    • data = Japanese pop; PCs = 20; n_neighbors = 100; min_dist = 0.99

0.1.3.2 Documentation

  • umapr: wraps the Python implementation of UMAP to make the algorithm accessible from within R
  • umap: R package that provides an interface to the UMAP algorithm, including a translation of the original algorithm into R with minimal dependencies
  • uwot: R package implementing the UMAP dimensionality reduction method that also implements the supervised and metric (out-of-sample) learning extensions to the basic method.
  • umap: Read the docs for the python implementation
parameter umapr umap uwot
include_input TRUE NA NA
n_neighbors 15 15 15
n_components 2 2 2
metric euclidean euclidean euclidean
n_epochs NULL 200 NULL
learning_rate 1 NA 1
alpha 1 1 NA
init spectral spectral spectral
spread 1 1 1
min_dist 0.1 0.1 0.01
set_op_mix_ratio 1 1 1
local_connectivity 1 1 1
repulsion_strength 1 NA 1
bandwidth 1 1 1
gamma 1 1 NA
negative_sample_rate 5 5 5
transform_queue_size 4 NA NA
a NULL NULL NULL
b NULL NULL NULL
random_state NULL NULL NA
metric_kwds dict() NA NA
angular_rp_forest FALSE NA NA
target_n_neighbors -1 NA n_neighbors
target_metric categorical NA euclidean
target_metric_kwds dict() NA NA
target_weight 0.5 NA 0.5
transform_seed 42 NA NA

0.2 mtReference Panel

  • Read in map/ped files
  • assign mitochondrial haplogroups using Hi-Mc
  • Convert alleles to 0/1 coding
## Provenence 
country <- read_tsv("/Users/sheaandrews/GitCode/MitoImputePrep/DerivedData/ReferencePanel_v6/ReferencePanel_v1_highQual_MAF0.01_filtered_countryInfo.ped") %>% 
  select(Family, Individual, geographic_data, Continent, Country, Region)

## Haplogroup assignments 
dat <- generate_snp_data_fixed("/Users/sheaandrews/GitCode/MitoImpute/ReferencePanel/ReferencePanel.map",
                               "/Users/sheaandrews/GitCode/MitoImpute/ReferencePanel/ReferencePanel.ped")

data(nodes)
classifications.raw <- HiMC::getClassifications(dat)
classifications <- as_tibble(classifications.raw[-nrow(classifications.raw),]) #remove Mitochondria_Information row 

ped <- dat %>% 
  magrittr::set_colnames(
    c("Family", "Individual", "Father", "Mother", "Sex", "Phenotype", 
      paste0('mt', colnames(magrittr::extract(., 7:ncol(.))))
      )
    ) %>% 
  as_tibble() %>%
  na_if(., "0 0") %>% 
  mutate_at(vars(starts_with("mt")), as_factor) %>% 
  mutate_at(vars(starts_with("mt")), as.numeric) %>% 
  left_join(classifications) %>% 
  left_join(country) %>% 
  mutate(macro = ifelse(str_detect(haplogroup, "^L"),
                        substr(haplogroup, start = 1, stop = 2),
                        substr(haplogroup, start = 1, stop = 1)), 
         Continent = replace_na(Continent, "Unknown"),         
         supclu = case_when(
            macro %in% c("L0", "L1", "L2", "L3") ~ "L",
            macro %in% c("Q", "G", "E", "D", "C", "Z", "M") ~ "M",
            macro %in% c("A", "S", "O", "Y", "I", "W", "X", "N") ~ "N",
            macro %in% c("R", "P", "B", "F", "J", "T", "U", "K", "H", "V") ~ "R"
  )) %>% 
  select(Family, Individual, Father, Mother, Sex, Phenotype, haplogroup, macro, supclu,
         geographic_data, Continent, Country, Region, everything()) %>% 
  filter(macro %nin% c("u")) %>% 
  filter(!is.na(macro))

0.2.1 PCA analysis

Run PCA analysis of mtReference Panel

  • extract results for individules
  • plot 2D and 3D results
## PCA 
res.pca <- ped %>% 
  select(starts_with("mt")) %>% 
  PCA(., scale.unit = F, ncp = 10, graph = FALSE)

ind.pca <- get_pca_ind(res.pca)$coord %>% 
  as_tibble() %>%
  bind_cols(select(ped, Individual, haplogroup, macro, supclu, Continent)) %>% 
  select(Individual, haplogroup, macro, Continent, supclu, everything()) 
eig.val <- get_eigenvalue(res.pca)
fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50))

ggplot(ind.pca, aes(x = Dim.1, y = Dim.2, colour = macro)) + 
  scale_color_manual(values = cols25()) + 
  geom_point() + theme_bw()

ggplot(ind.pca, aes(x = Dim.1, y = Dim.2, colour = supclu)) + 
  scale_color_manual(values = cols25()) + 
  geom_point() + theme_bw()

ggplot() + 
  geom_point(data = filter(ind.pca, Continent == "Unknown"), aes(x = Dim.1, y = Dim.2, colour = Continent)) + 
  geom_point(data = filter(ind.pca, Continent != "Unknown"), aes(x = Dim.1, y = Dim.2, colour = Continent)) + 
  scale_color_manual(values = cols25()) + 
  theme_bw()

plot_ly(ind.pca, x = ~Dim.2, y = ~Dim.1, z = ~Dim.3, color = ~macro, colors = cols25(),
        type = 'scatter3d', mode = 'markers', marker = list(size = 3), 
        hoverinfo = 'text', text = ~paste('</br> ID: ', Individual, 
                                          '</br> Haplogroup: ', macro,
                                          '</br> PC1: ', round(Dim.1, 2),
                                          '</br> PC2: ', round(Dim.2, 2),
                                          '</br> PC3: ', round(Dim.3, 2)))
plot_ly(ind.pca, x = ~Dim.2, y = ~Dim.1, z = ~Dim.3, color = ~supclu, colors = cols25(),
        type = 'scatter3d', mode = 'markers', marker = list(size = 3), 
        hoverinfo = 'text', text = ~paste('</br> ID: ', Individual, 
                                          '</br> Haplogroup: ', macro,
                                          '</br> PC1: ', round(Dim.1, 2),
                                          '</br> PC2: ', round(Dim.2, 2),
                                          '</br> PC3: ', round(Dim.3, 2)))
plot_ly(ind.pca, x = ~Dim.2, y = ~Dim.1, z = ~Dim.3, color = ~Continent, colors = cols25(),
        type = 'scatter3d', mode = 'markers', marker = list(size = 3), 
        hoverinfo = 'text', text = ~paste('</br> ID: ', Individual, 
                                          '</br> Haplogroup: ', macro,
                                          '</br> PC1: ', round(Dim.1, 2),
                                          '</br> PC2: ', round(Dim.2, 2),
                                          '</br> PC3: ', round(Dim.3, 2)))

0.2.2 UMAP

embedding_mtref_2d <- select(ind.pca, starts_with("Dim")) %>% 
  uwot::umap(., min_dist = 0.5, n_neighbor = 100, n_components = 2) %>% 
  as.tibble() %>%
  bind_cols(select(ped, Individual, haplogroup, macro, supclu, Continent)) %>%
  rename(UMAP1 = V1, UMAP2 = V2) 
  

ggpubr::ggarrange(
ggplot(embedding_mtref_2d, aes(x = UMAP1, y = UMAP2, color = macro)) + 
  geom_point() + 
  scale_color_manual(values = cols25()) +
  theme_bw(),

ggplot(embedding_mtref_2d, aes(x = UMAP1, y = UMAP2, color = supclu)) + 
  geom_point() + 
  scale_color_manual(values = cols25()) +
  theme_bw(),

ggplot() + 
  geom_point(data = filter(embedding_mtref_2d, Continent == "Unknown"), aes(x = UMAP1, y = UMAP2, colour = Continent)) + 
  geom_point(data = filter(embedding_mtref_2d, Continent != "Unknown"), aes(x = UMAP1, y = UMAP2, colour = Continent)) + 
  scale_color_manual(values = cols25()) +
  theme_bw()
)

embedding_mtref_3d <- select(ind.pca, starts_with("Dim")) %>% 
  uwot::umap(., min_dist = 0.5, n_neighbor = 100, n_components = 3)  %>% as.tibble() %>%
  bind_cols(select(ped, Individual, haplogroup, macro, supclu, Continent)) %>%
  rename(UMAP1 = V1, UMAP2 = V2, UMAP3 = V3)

plot_ly(embedding_mtref_3d, x = ~UMAP2, y = ~UMAP1, z = ~UMAP3, color = ~macro, colors = cols25(), type = 'scatter3d', mode = 'markers', marker = list(size = 3), 
        hoverinfo = 'text', text = ~paste('</br> ID: ', Individual, 
                                          '</br> Haplogroup: ', macro,
                                          '</br> PC1: ', round(UMAP1, 2),
                                          '</br> PC2: ', round(UMAP2, 2),
                                          '</br> PC3: ', round(UMAP3, 2)))
plot_ly(embedding_mtref_3d, x = ~UMAP2, y = ~UMAP1, z = ~UMAP3, color = ~supclu, colors = cols25(), type = 'scatter3d', mode = 'markers', marker = list(size = 3), 
        hoverinfo = 'text', text = ~paste('</br> ID: ', Individual, 
                                          '</br> Haplogroup: ', macro,
                                          '</br> PC1: ', round(UMAP1, 2),
                                          '</br> PC2: ', round(UMAP2, 2),
                                          '</br> PC3: ', round(UMAP3, 2)))
plot_ly(filter(embedding_mtref_3d, Continent != "Unknown"), x = ~UMAP2, y = ~UMAP1, z = ~UMAP3, color = ~Continent, colors = cols25(), type = 'scatter3d', mode = 'markers', marker = list(size = 3), 
        hoverinfo = 'text', text = ~paste('</br> ID: ', Individual, 
                                          '</br> Haplogroup: ', macro,
                                          '</br> PC1: ', round(UMAP1, 2),
                                          '</br> PC2: ', round(UMAP2, 2),
                                          '</br> PC3: ', round(UMAP3, 2)))

Use tricolore to color individules by where they fall on the first three PCs from the PCA analysis in UMAP

## tricolore 
# color-code the data set and generate a color-key
embedding_mtref_tri <- embedding_mtref_2d %>% 
  left_join(select(ind.pca, Individual, starts_with("Dim"))) %>% 
  # add constant to make all values to postive for tricolore
  mutate(Dim.1 = Dim.1 + (min(Dim.1) * -1), 
         Dim.2 = Dim.2 + (min(Dim.2) * -1), 
         Dim.3 = Dim.3 + (min(Dim.3) * -1),) 

tri_map <- tricolore::Tricolore(embedding_mtref_tri, p1 = 'Dim.1', p2 = 'Dim.2', p3 = 'Dim.3', breaks = 10) 

embedding_mtref_tri <- mutate(embedding_mtref_tri, rgb = tri_map %>% magrittr::use_series(rgb))

ggplot(embedding_mtref_tri, aes(x = UMAP1, y = UMAP2, color = rgb)) + 
  geom_point() +  theme_bw() + 
  theme(legend.position = "none") + 
  labs(title = "PCA+UMAP: Ternery")

tri_map$key

0.3 Comparison of UMAP packages

min_dist = 0.5
n_neighbor = 100
n_components = 2
n_epochs = 200
init = "random"

mtref_umapr_tab <- select(ind.pca, starts_with("Dim")) %>% 
  umapr::umap(., min_dist = min_dist, n_neighbor = n_neighbor, n_components = n_components, n_epochs = NULL, init = init) %>%
  as.tibble() %>%
  bind_cols(select(ped, Individual, haplogroup, macro, Continent))
## [1] "umap-learn already installed"
p1 <- ggplot(mtref_umapr_tab, aes(x = UMAP1, y = UMAP2, color = macro)) + 
  geom_point() + 
    scale_color_manual(values = cols25()) +
    theme_bw() + 
  labs(title = "UMAPR")

mtref_umap_res <- select(ind.pca, starts_with("Dim")) %>% 
  umap::umap(., min_dist = min_dist, n_neighbor = n_neighbor, n_components = n_components, n_epochs = n_epochs, init = init)  

mtref_umap_tab <- mtref_umap_res %>% magrittr::use_series(layout) %>% 
  as.tibble() %>%
  bind_cols(select(ped, Individual, haplogroup, macro, Continent)) 

p2 <- ggplot(mtref_umap_tab, aes(x = V1, y = V2, color = macro)) + 
  geom_point() + 
    scale_color_manual(values = cols25()) +
    theme_bw() + 
  labs(title = "UMAP")

mtref_uwot_tab <- select(ind.pca, starts_with("Dim")) %>% 
  uwot::umap(., min_dist = min_dist, n_neighbor = n_neighbor, n_components = n_components, n_epochs = n_epochs, init = init) %>% 
  as.tibble() %>%
  bind_cols(select(ped, Individual, haplogroup, macro, Continent))

p3 <- ggplot(mtref_uwot_tab, aes(x = V1, y = V2, color = macro)) + 
  geom_point() + 
    scale_color_manual(values = cols25()) +
    theme_bw() + 
  labs(title = "UWOT")

ggpubr::ggarrange(
  p1, p2, p3,  ncol = 3, legend = "bottom", common.legend = TRUE
)

0.4 Thousand Genomes

0.4.1 Data Wrangling

## Haplogroup assignments 
dat_1kg <- generate_snp_data_fixed(
  "/Users/sheaandrews/GitCode/MitoImputePrep/DerivedData/ThousandGenomes/chrMT_1kg_norm_decomposed_firstAlt.map",
  "/Users/sheaandrews/GitCode/MitoImputePrep/DerivedData/ThousandGenomes/chrMT_1kg_norm_decomposed_firstAlt.ped")
info_1kg <- read_tsv("/Users/sheaandrews/GitCode/MitoImputePrep/data/ThousandGenomes/20130606_g1k.ped", guess_max = 10000) %>% 
  select(Individual = `Individual ID`, Family_ID = `Family ID`, Father = `Paternal ID`, Mother = `Maternal ID`, Population, Gender) %>% 
  mutate(cont = case_when(
    Population == "ESN" ~ "AFR",
    Population == "GWD" ~ "AFR",
    Population == "LWK" ~ "AFR",
    Population == "MSL" ~ "AFR",
    Population == "YRI" ~ "AFR",
    Population == "JPT" ~ "EAS",
    Population == "CHB" ~ "EAS",
    Population == "CDX" ~ "EAS",
    Population == "CHS" ~ "EAS",
    Population == "KHV" ~ "EAS",
    Population == "CEU" ~ "EUR",
    Population == "FIN" ~ "EUR",
    Population == "GBR" ~ "EUR",
    Population == "IBS" ~ "EUR",
    Population == "TSI" ~ "EUR",
    Population == "BEB" ~ "IND",
    Population == "GIH" ~ "IND",
    Population == "ITU" ~ "IND",
    Population == "PJL" ~ "IND",
    Population == "STU" ~ "IND",
    Population == "ACB" ~ "AMR",
    Population == "ASW" ~ "AMR",
    Population == "CLM" ~ "AMR",
    Population == "MXL" ~ "AMR",
    Population == "PEL" ~ "AMR",
    Population == "PUR" ~ "AMR",
  ))

data(nodes)
classifications_1kg.raw <- HiMC::getClassifications(dat_1kg)
classifications_1kg <- as_tibble(classifications_1kg.raw[-nrow(classifications_1kg.raw),]) #remove Mitochondria_Information row 

ped_1kg <- dat_1kg %>% 
  magrittr::set_colnames(
    c("Family", "Individual", "Father", "Mother", "Sex", "Phenotype", 
      paste0('mt', colnames(magrittr::extract(., 7:ncol(.))))
      )
    ) %>% 
  as_tibble() %>%
  na_if(., "0 0") %>% 
  mutate_at(vars(starts_with("mt")), as_factor) %>% 
  mutate_at(vars(starts_with("mt")), as.numeric) %>% 
  left_join(classifications_1kg) %>% 
  mutate(macro = ifelse(str_detect(haplogroup, "^L"),
                        substr(haplogroup, start = 1, stop = 2),
                        substr(haplogroup, start = 1, stop = 1)), 
         supclu = case_when(
    macro %in% c("L0", "L1", "L2", "L3") ~ "L",
    macro %in% c("Q", "G", "E", "D", "C", "Z", "M") ~ "M",
    macro %in% c("A", "S", "O", "Y", "I", "W", "X", "N") ~ "N",
    macro %in% c("R", "P", "B", "F", "J", "T", "U", "K", "H", "V") ~ "R"
  )) %>% 
  select(-Family, -Sex, -Father, -Mother) %>%
  left_join(info_1kg, by = "Individual") %>%
  select(Family = Family_ID, Individual, Father, Mother, Sex = Gender, Phenotype, 
         pop = Population, haplogroup, macro, supclu, full_path, everything()) %>% 
  filter(macro %nin% c("u")) %>% 
  filter(!is.na(macro)) # %>% 
  # group_by(Family) %>% 
  # slice(1) %>% 
  # ungroup()
#  filter(macro %nin% c("L0", "L1", "L2", "L3"))

miss_gt <- ped_1kg %>%
  select(everything()) %>%  # replace to your needs
  summarise_all( ~ sum(is.na(.))) %>% 
  t() %>% 
  as_tibble(rownames = "var") %>% 
  filter(V1 > 0) %>% pull(var)
count(ped_1kg, pop) %>% print(n = Inf)
count(ped_1kg, cont) %>% print(n = Inf)
count(ped_1kg, supclu) %>% print(n = Inf)
count(ped_1kg, macro) %>% print(n = Inf)
count(ped_1kg, haplogroup) %>% print(n = Inf)

0.4.2 PCA

## PCA 
res.pca_1kg <- ped_1kg %>% 
  select(starts_with("mt")) %>% 
  select(-all_of(miss_gt)) %>%
#  mutate_all(., .funs = function(x){ifelse(is.na(x), round(mean(x, na.rm = TRUE)), x)}) %>%
  PCA(., scale.unit = F, ncp = 10, graph = FALSE)

ind.pca_1kg <- get_pca_ind(res.pca_1kg)$coord %>% 
  as_tibble() %>%
  bind_cols(select(ped_1kg, Individual, haplogroup, macro, supclu, pop, cont)) %>% 
  select(Individual, haplogroup, macro, supclu, everything()) 
eig.val_1kg <- get_eigenvalue(res.pca_1kg)
fviz_eig(res.pca_1kg, addlabels = TRUE, ylim = c(0, 50))

ggplot(ind.pca_1kg, aes(x = Dim.1, y = Dim.2, colour = macro)) + 
  geom_point() + 
  scale_color_manual(values = cols25()) + 
  theme_bw() + theme(legend.position = "bottom")

ggplot(ind.pca_1kg, aes(x = Dim.1, y = Dim.2, colour = supclu)) + 
  geom_point() + 
  scale_color_manual(values = cols25()) + 
  theme_bw() + theme(legend.position = "bottom")

ggplot(ind.pca_1kg, aes(x = Dim.1, y = Dim.2, colour = cont)) + 
  geom_point() + 
  scale_color_manual(values = cols25()) + 
  theme_bw() + theme(legend.position = "bottom")

plot_ly(ind.pca_1kg, x = ~Dim.2, y = ~Dim.1, z = ~Dim.3, color = ~macro, colors = cols25(), 
        type = 'scatter3d', mode = 'markers', marker = list(size = 3), 
        hoverinfo = 'text', text = ~paste('</br> ID: ', Individual, 
                                          '</br> Haplogroup: ', macro,
                                          '</br> PC1: ', round(Dim.1, 2),
                                          '</br> PC2: ', round(Dim.2, 2),
                                          '</br> PC3: ', round(Dim.3, 2)))
plot_ly(ind.pca_1kg, x = ~Dim.2, y = ~Dim.1, z = ~Dim.3, color = ~supclu, colors = cols25(), 
        type = 'scatter3d', mode = 'markers', marker = list(size = 3), 
        hoverinfo = 'text', text = ~paste('</br> ID: ', Individual, 
                                          '</br> Haplogroup: ', macro,
                                          '</br> PC1: ', round(Dim.1, 2),
                                          '</br> PC2: ', round(Dim.2, 2),
                                          '</br> PC3: ', round(Dim.3, 2)))
plot_ly(ind.pca_1kg, x = ~Dim.2, y = ~Dim.1, z = ~Dim.3, color = ~cont, colors = cols25(), 
        type = 'scatter3d', mode = 'markers', marker = list(size = 3), 
        hoverinfo = 'text', text = ~paste('</br> ID: ', Individual, 
                                          '</br> Haplogroup: ', macro,
                                          '</br> PC1: ', round(Dim.1, 2),
                                          '</br> PC2: ', round(Dim.2, 2),
                                          '</br> PC3: ', round(Dim.3, 2)))

0.4.3 UMAP

## UMAP 
embedding_mt_2d <- ped_1kg %>% 
  select(starts_with("mt")) %>% 
  select(-all_of(miss_gt)) %>%
#  mutate_all(., .funs = function(x){ifelse(is.na(x), round(mean(x, na.rm = TRUE)), x)}) %>%
  uwot::umap(., min_dist = 0.5, n_neighbor = 50, n_components = 2, verbose = TRUE) %>% 
  as.tibble() %>%
  bind_cols(select(ped_1kg, Individual, haplogroup, macro, supclu, pop, cont)) %>%
  rename(UMAP1 = V1, UMAP2 = V2)

embedding_mt_3d <- ped_1kg %>% 
  select(starts_with("mt")) %>% 
  select(-all_of(miss_gt)) %>%
#  mutate_all(., .funs = function(x){ifelse(is.na(x), round(mean(x, na.rm = TRUE)), x)}) %>%
  uwot::umap(., min_dist = 0.5, n_neighbor = 50, n_components = 3) %>% 
  as.tibble() %>%
  bind_cols(select(ped_1kg, Individual, haplogroup, macro, supclu, pop, cont)) %>%
  rename(UMAP1 = V1, UMAP2 = V2, UMAP3 = V3)

## PCA+UMAP
embedding_pca_2d <- select(ind.pca_1kg, starts_with("Dim")) %>% 
  uwot::umap(., min_dist = 0.5, n_neighbor = 50, n_components = 2, verbose = TRUE, init = "spectral") %>% 
  as.tibble() %>%
  bind_cols(select(ped_1kg, Individual, haplogroup, macro, supclu, pop, cont)) %>%
  rename(UMAP1 = V1, UMAP2 = V2)

embedding_pca_3d <- select(ind.pca_1kg, starts_with("Dim")) %>%
  uwot::umap(., min_dist = 0.5, spread = 1, n_neighbor = 50, n_components = 3) %>% 
  as.tibble() %>%
  bind_cols(select(ped_1kg, Individual, haplogroup, macro, supclu, pop, cont)) %>%
  rename(UMAP1 = V1, UMAP2 = V2, UMAP3 = V3)
ggpubr::ggarrange(
  ggplot(embedding_mt_2d, aes(x = UMAP1, y = UMAP2, color = macro)) + 
    geom_point() + scale_color_manual(values = cols25()) +  theme_bw() + labs(title = "UMAP"),
  ggplot(embedding_pca_2d, aes(x = UMAP1, y = UMAP2, color = macro)) + 
    geom_point() + scale_color_manual(values = cols25()) +  theme_bw() + labs(title = "PCA+UMAP"), 
  common.legend = TRUE, legend = "bottom"
)

ggpubr::ggarrange(
  ggplot(embedding_pca_2d, aes(x = UMAP1, y = UMAP2, color = cont)) + 
    geom_point() + scale_color_manual(values = cols25()) +  theme_bw() + labs(title = "PCA+UMAP: Continent"),
  ggplot(embedding_pca_2d, aes(x = UMAP1, y = UMAP2, color = macro)) + 
    geom_point() + scale_color_manual(values = cols25()) +  theme_bw() + labs(title = "PCA+UMAP: mtHg"), 
  ggplot(embedding_pca_2d, aes(x = UMAP1, y = UMAP2, color = supclu)) + 
    geom_point() + scale_color_manual(values = cols25()) +  theme_bw() + labs(title = "PCA+UMAP: mtSuperCluster")

)

plot_ly(embedding_mt_3d, x = ~UMAP2, y = ~UMAP1, z = ~UMAP3, color = ~macro, colors = cols25(),
        type = 'scatter3d', mode = 'markers', marker = list(size = 3),
        hoverinfo = 'text', text = ~paste('</br> ID: ', Individual, 
                                          '</br> Macro: ', macro,
                                          '</br> Haplogroup: ', haplogroup,
                                          '</br> PC1: ', round(UMAP1, 2),
                                          '</br> PC2: ', round(UMAP2, 2),
                                          '</br> PC3: ', round(UMAP3, 2))) %>%  
        layout(title="UMAP")
plot_ly(embedding_pca_3d, x = ~UMAP2, y = ~UMAP1, z = ~UMAP3, color = ~macro, colors = cols25(), 
        type = 'scatter3d', mode = 'markers', marker = list(size = 3), 
        hoverinfo = 'text', text = ~paste('</br> ID: ', Individual, 
                                          '</br> Macro: ', macro,
                                          '</br> Haplogroup: ', haplogroup,
                                          '</br> Pop: ', pop,
                                          '</br> Continent: ', cont,
                                          '</br> PC1: ', round(UMAP1, 2),
                                          '</br> PC2: ', round(UMAP2, 2),
                                          '</br> PC3: ', round(UMAP3, 2))) %>%  
        layout(title="PCA+UMAP")
plot_ly(embedding_pca_3d, x = ~UMAP2, y = ~UMAP1, z = ~UMAP3, color = ~cont, colors = cols25(), 
        type = 'scatter3d', mode = 'markers', marker = list(size = 3), 
        hoverinfo = 'text', text = ~paste('</br> ID: ', Individual, 
                                          '</br> Macro: ', macro,
                                          '</br> Haplogroup: ', haplogroup,
                                          '</br> Pop: ', pop,
                                          '</br> Continent: ', cont,
                                          '</br> PC1: ', round(UMAP1, 2),
                                          '</br> PC2: ', round(UMAP2, 2),
                                          '</br> PC3: ', round(UMAP3, 2))) %>%  
        layout(title="PCA+UMAP")
plot_ly(embedding_pca_3d, x = ~UMAP2, y = ~UMAP1, z = ~UMAP3, color = ~supclu, colors = cols25(), 
        type = 'scatter3d', mode = 'markers', marker = list(size = 3), 
        hoverinfo = 'text', text = ~paste('</br> ID: ', Individual, 
                                          '</br> Macro: ', macro,
                                          '</br> Haplogroup: ', haplogroup,
                                          '</br> Pop: ', pop,
                                          '</br> Continent: ', cont,
                                          '</br> PC1: ', round(UMAP1, 2),
                                          '</br> PC2: ', round(UMAP2, 2),
                                          '</br> PC3: ', round(UMAP3, 2))) %>%  
        layout(title="PCA+UMAP")

0.4.3.1 Hypurr-ameters

### UMAP Function 
umap_helper <- function(data, pcs, min_dist, n_neighbor, n_components){
  pb$tick()$print() ## Progress bar
  out <- select(data, Dim.1:glue("Dim.", pcs)) %>% 
    uwot::umap(., min_dist = min_dist, spread = 1, n_neighbor = n_neighbor, n_components = n_components) %>% 
    as.tibble() %>%
    bind_cols(select(data, Individual, haplogroup, macro, pop, supclu, cont)) %>%
    rename(UMAP1 = V1, UMAP2 = V2)  
  out
}

### Hyper-parameters to iterative over 
umap_param <- tibble(pcs = 2:10, 
                     min_dist = rep(0.5, length(pcs)), 
                     n_neighbor = rep(50, length(pcs)), 
                     n_components = rep(2,3, length(pcs))
)

umap_param <- expand_grid(
  pcs = 2:10, 
  n_components = 2, 
  min_dist = c(0.1, 0.25, 0.5, 0.75, 0.99), 
  n_neighbor = c(15, 25, 50, 75, 100)
)

pb <- progress_estimated(nrow(umap_param)) ## initiate Progress bar
hp_1kg_embedding <- umap_param %>% 
  rowwise() %>% 
  mutate(res = list(umap_helper(ind.pca_1kg, pcs = pcs, min_dist = min_dist, 
                                n_neighbor = n_neighbor, n_components = n_components)))
hp_1kg_embedding_plots <- hp_1kg_embedding %>% 
  mutate(plots = list(
    ggplot(res, aes(x = UMAP1, y = UMAP2, color = macro)) +
      geom_point() + scale_color_manual(values = cols25()) + theme_bw()
    ))

## PC Plots (n_components == 2, min_dist == 0.5, n_neighbor == 50)
ggpubr::ggarrange(plotlist =  filter(hp_1kg_embedding_plots, 
                                     pcs %in% c(2:10), 
                                     n_components == 2, 
                                     min_dist == 0.1, 
                                     n_neighbor == 15)  %>% pull(plots), 
                  legend = "bottom", 
                  common.legend = TRUE, 
                  labels = glue("PCs: {npc}", npc = 2:10))

## n_neigbor (pcs == 10, n_components == 2, min_dist == 0.5)
ggpubr::ggarrange(plotlist =  filter(hp_1kg_embedding_plots, 
                                     pcs == 10, 
                                     n_components == 2, 
                                     min_dist == 0.1, 
                                     n_neighbor %in% c(10, 25, 50, 75, 100)) %>% 
                    pull(plots), 
                  legend = "bottom", 
                  common.legend = TRUE, 
                  labels = glue("NN: {npc}", npc = c(10, 25, 50, 75, 100)))

## m_dist (pcs == 10, n_components == 2, n_neighbor == 50)
ggpubr::ggarrange(plotlist = filter(hp_1kg_embedding_plots, 
                                    pcs == 10, 
                                     n_components == 2, 
                                     min_dist %in% c(0.1, 0.25, 0.5, 0.75, 0.99), 
                                     n_neighbor == 15) %>% 
                    pull(plots), 
                  legend = "bottom", 
                  common.legend = TRUE, 
                  labels = glue("MD: {npc}", npc = c(0.1, 0.25, 0.5, 0.75, 0.99)))

0.5 Projection

umap_pca_2d <- select(ind.pca_1kg, starts_with("Dim")) %>% 
  uwot::umap(., min_dist = 0.5, n_neighbor = 50, n_components = 2) %>% 
  as.tibble() %>% 
  rename(UMAP1 = V1, UMAP2 = V2) %>%
  bind_cols(ind.pca_1kg) %>%
  select(Individual, haplogroup, macro, supclu, pop, cont, starts_with("Dim"), starts_with("UMAP"))

ggpubr::ggarrange(
  ggplot(embedding_pca_2d, aes(x = UMAP1, y = UMAP2, color = macro)) + 
    geom_point() + scale_color_manual(values = cols25()) +  theme_bw() + labs(title = "umapr::umap"),
  ggplot(umap_pca_2d, aes(x = UMAP1, y = UMAP2, color = macro)) + 
    geom_point() + scale_color_manual(values = cols25()) +  theme_bw() + labs(title = "uwot::umap")
)

ped_joint <- bind_rows(
  ped %>%
    mutate(data = "mtRef") %>%
    select(Individual, haplogroup, macro, supclu, data, starts_with("mt")), 
  
  ped_1kg %>%
    mutate(data = "tkg") %>%
    select(Individual, haplogroup, macro, supclu, data, starts_with("mt"))
) %>% 
  arrange(data, Individual)


miss_gt_joint <- ped_joint %>%
  select(starts_with("mt")) %>%  # replace to your needs
  summarise_all( ~ sum(is.na(.))) %>% 
  t() %>% 
  as_tibble(rownames = "var") %>% 
  filter(V1 > 1000) %>% pull(var)

## Join DF
ped_joint_df <- ped_joint %>% 
  select(-all_of(miss_gt_joint)) %>% 
  split(.$haplogroup) %>%
  map(., ~mutate_all(., .funs = function(x){ifelse(is.na(x), round(mean(x, na.rm = TRUE)), x)})) %>%
  bind_rows() %>% 
  arrange(data, Individual) %>%
  split(.$data) %>% 
  map(., select, starts_with("mt")) 
## PCA 
res.pca_joint <- ped_joint_df %>%
  map(~PCA(., scale.unit = F, ncp = 10, graph = FALSE))

## mtRef Predicted PCA 
res.pca_mtref_pred <- predict(magrittr::extract2(res.pca_joint, "tkg"), magrittr::extract2(ped_joint_df, "mtRef")) %>% 
  magrittr::use_series(coord) %>% 
  as_tibble() %>% 
  bind_cols(select(filter(ped_joint, data == "mtRef"), Individual, haplogroup, macro, supclu, data)) %>% 
  mutate(data = "mtRef_pred") %>% 
  select(Individual, data, haplogroup, macro, supclu, everything()) 

## 1kg Predicted PCA     
res.pca_tkg_pred <- predict(magrittr::extract2(res.pca_joint, "mtRef"), magrittr::extract2(ped_joint_df, "tkg")) %>% 
  magrittr::use_series(coord) %>% 
  as_tibble() %>% 
  bind_cols(select(filter(ped_joint, data == "tkg"), Individual, haplogroup, macro, supclu, data)) %>% 
  mutate(data = "tkg_pred") %>% 
  select(Individual, data, haplogroup, macro, supclu, everything()) 

## Join observed and predicted dataframes
ind.pca_joint <- res.pca_joint %>% 
  map(get_pca_ind) %>% 
  map(magrittr::use_series, coord) %>% 
  map(as_tibble) %>%
  bind_rows() %>%
  bind_cols(select(ped_joint, Individual, haplogroup, macro, supclu, data)) %>% 
  select(Individual, data, haplogroup, macro, supclu, everything()) %>% 
  bind_rows(res.pca_mtref_pred) %>% # mtRef Predicted
  bind_rows(res.pca_tkg_pred) %>% # tkg Predicted
  split(.$data)

ggpubr::ggarrange(
  ind.pca_joint %>% 
    bind_rows() %>%
    ggplot(., aes(x = Dim.1, y = Dim.2, colour = macro)) + geom_point() + facet_grid(. ~ data) + 
    scale_color_manual(values = cols25()),  

  ind.pca_joint %>% 
    bind_rows() %>%
    ggplot(., aes(x = Dim.1, y = Dim.2, colour = supclu)) + geom_point() + facet_grid(. ~ data) + 
    scale_color_manual(values = cols25())
)

# UMAP 
## Predict mtRef using tkg
tkg.umap_res = ind.pca_joint %>% 
  magrittr::use_series(tkg) %>% 
  select(starts_with("Dim")) %>%
  uwot::umap(., min_dist = 0.5, n_neighbor = 50, n_components = 2, verbose = TRUE, ret_model = TRUE)

tkg.umap_res_tab <- tkg.umap_res %>% 
  magrittr::use_series(embedding) %>% 
  as_tibble() %>% 
  bind_cols(magrittr::use_series(ind.pca_joint, tkg)) %>% 
  select(Individual, data, haplogroup, macro, supclu, UMAP1 = V1, UMAP2 = V2, everything())

p1 <- ggplot(tkg.umap_res_tab, aes(x = UMAP1, y = UMAP2, colour = macro)) + geom_point() + facet_grid(. ~ data) + 
  scale_color_manual(values = cols25())
p1 

mtRef_pred.umap_res <- uwot::umap_transform(magrittr::use_series(ind.pca_joint, mtRef_pred) %>% 
                                        select(starts_with("Dim")), 
                                        tkg.umap_res, 
                                      verbose = TRUE)

mtRef_pred.umap_tab <- mtRef_pred.umap_res %>% 
  as_tibble()  %>% 
  bind_cols(magrittr::use_series(ind.pca_joint, mtRef_pred)) %>% 
  select(Individual, data, haplogroup, macro, supclu, UMAP1 = V1, UMAP2 = V2, everything())

p2 <- ggplot(mtRef_pred.umap_tab, aes(x = UMAP1, y = UMAP2, colour = macro)) + geom_point() + facet_grid(. ~ data) + 
  scale_color_manual(values = cols25())
p2

## Predict tkg using mtRef
mtRef.umap_res = ind.pca_joint %>% 
 magrittr::use_series(mtRef) %>% 
  select(starts_with("Dim")) %>%
  uwot::umap(., min_dist = 0.5, n_neighbor = 100, n_components = 2, verbose = TRUE, ret_model = TRUE)

mtRef.umap_res_tab <- mtRef.umap_res %>% 
  magrittr::use_series(embedding) %>% 
  as_tibble() %>% 
  bind_cols(magrittr::use_series(ind.pca_joint, mtRef)) %>% 
  select(Individual, data, haplogroup, macro, UMAP1 = V1, UMAP2 = V2, everything())

p3 <- ggplot(mtRef.umap_res_tab, aes(x = UMAP1, y = UMAP2, colour = macro)) + geom_point() + facet_grid(. ~ data) + 
  scale_color_manual(values = cols25())
p3

tkg_pred.umap_res <- uwot::umap_transform(magrittr::use_series(ind.pca_joint, tkg_pred) %>% 
                                        select(starts_with("Dim")), 
                                        mtRef.umap_res, 
                                      verbose = TRUE)

tkg_pred.umap_tab <- tkg_pred.umap_res %>% 
  as_tibble()  %>% 
  bind_cols(magrittr::use_series(ind.pca_joint, tkg_pred)) %>% 
  select(Individual, data, haplogroup, macro, supclu, UMAP1 = V1, UMAP2 = V2, everything())


p4 <- ggplot(tkg_pred.umap_tab, aes(x = UMAP1, y = UMAP2, colour = macro)) + geom_point() + facet_grid(. ~ data) + 
  scale_color_manual(values = cols25())
p4

ggpubr::ggarrange(
  ggplot() + 
    geom_point(data = tkg.umap_res_tab, aes(x = UMAP1, y = UMAP2)) + 
    geom_point(data = mtRef_pred.umap_tab, aes(x = UMAP1, y = UMAP2, colour = macro)) + 
    theme_bw(),
  
  ggplot() + 
    geom_point(data = mtRef.umap_res_tab, aes(x = UMAP1, y = UMAP2)) + 
    geom_point(data = tkg_pred.umap_tab, aes(x = UMAP1, y = UMAP2, colour = macro)) + 
    theme_bw(), 
  legend = "bottom", common.legend = TRUE
)

## Joint Plot 
ggpubr::ggarrange(
  p1, p2, p3, p4,  ncol = 2, nrow = 2, legend = "bottom", common.legend = TRUE
)